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1. INTRODUCTION 

Director of processing and marketing of plantation products, Ministry of Agriculture of Indonesia, 
stated that Indonesia's coffee production in 2020 is 753,491 tons from an area of 1.2 million hectares of land 
or reaching up to 806 kg/ha. Meanwhile, in 2019, the production is about 752,511 tons from an area of 
1,245,358 ha or reaching up to 803 kg/ha. Data from the International Coffee Organization (ICO) shows that 
Indonesian coffee consumption increases yearly [1]. The Indonesian government supports the coffee business 
and avoids low price pressures for coffee farmers [2], [3]. Indonesia's coffee production comprises 72% 
Robusta, 27% Arabica, and 1% Liberica. The five main coffee production centers in the province of Indonesia 
are South Sumatra, Lampung, North Sumatra, Aceh, and East Java. In most cases, the tendency of people to 
find out the quality of coffee is specified based on regional origin. Coffee marketing has its characteristics, and 
each coffee beans have its distinctive aroma according to its original’s region. The government states that there 
are at least 16 types of Indonesian coffee in the international market. Seven of them are Sumatran Gayo Arabica 
coffee, Bali Kintamani Arabica coffee, Toraja Arabica coffee, Ijen Raung's Java Arabica coffee, Riau's 
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Rangsang Meranti Liberika coffee, Temanggung Robusta coffee. Coffee’s marketing is needed to be planned 
and should be well prepared, including the choice of price, promotion, distribution, and services that satisfy 
the needs, as well as potential buyers. 

In the business world, big data is used to collect customer information from every transaction that 
potentially occurs. Therefore, it can be used as a tool to prepare a market business. The information from big 
data will be used to identify consumer needs and determine the effective marketing strategies implemented in 
business through three critical components, namely volume, speed, and variety. Big data consists of a lot of 
information, and extensive data analysis is carried out on testing or processing large data sets and patterns of 
data correlation, which is valuable information to run a business. A supervised learning algorithm model is 
essential. In this problem, the machines work like learning media and depend on working with tasks to form 
an algorithm and the ambiguity between the classification and clustering algorithms. Many algorithms can be 
used to classify states or objects. Generally, if the number is higher than the limit range, it is considered a 
correct classification, but if it is lower, it is classified as incorrect. In cases to predict people with diabetes, 
machine learning algorithm three is used to form a decision model, comparing prediction results from naive 
Bayes, linear regression, and decision tree [4]. The previous article also informed the impact of light delays on 
customer satisfaction using the Naive Bayesian method, which shows the worst score obtained from score, 
accuracy, and error (RMSE) [5]. In this present research, the classification of coffees using machine learning 
techniques is developed. In this case, the coffee is classified into four items, and the category results are used 
in the quantitative variables of the Bayes model to produce the minimum validation error with a rate of 48.56%. 
Technological innovations such as big data could be implemented to help companies process rapidly growing 
data from complex data with low-cost expenses. Big data can also provide information development for 
decision-making to innovate and improve the capabilities based on knowledge or insights [6]-[9]. 

In this present research, the cases of coffee production and marketing are studied using machine 
learning technology with big data analysis [10]. The proposed technology model establishes a periodic 
monitoring system for farmers' coffee production, guarantees quality and price, and expands coffee shops’ 
interest [11]. In this case, the role of algorithms using big data is to study the trends that will occur in the future 
and apply the results of innovations to current business events [12]. Computers in a network series connection 
are connected and generated to continuously learn new knowledge or insights that would improve analytical 
skills and could be applied to solve the proposed research [13]. This, the proposed technology model, is to be 
applied to the data for which there is no direct information. This algorithm does not get a training data set, so 
it requires learning from existing data [14]. In this algorithm, no supervisor helps determine the performance 
and whether the resulting output is right or wrong, given some input samples without labels [15]. The 
unsupervised algorithm uses clustering and association methods [16], while the processing of unlabeled type 
data uses probability statistics [17]. The possibilities in an event are processed statistically by taking into 
account events that may occur, range 0, and events that do not happen, range 1 [18]. Statistics are closely 
related to machine learning [19] and are useful for finding patterns that have the smallest error values so that 
the results found are accurate. Furthermore, systems that use data mining processing that results in market 
segmentation into several clusters [20], analysis of grouping data with similarities to identify various groups 
of fields variable data of multiple items. Therefore, members of one cluster have more in common than 
members of other clusters [21]. The relationship between coffee production and coffee marketing is analyzed 
through machine learning modeling supported by BDA aimed at informing coffee marketing strategies 
according to the interests of coffee consumers. The availability of online data sourced from websites that were 
processed by intelligent machines improves big data work to map, group, and find trend solutions or market 
predictions as a result of big data analysis [22], [23]. 


2. METHOD 

There are three problems raised in this study, namely firstly predicting coffee production in Indonesia, 
focusing on Arabica coffee, then classifying the interests of coffee consumers based on regional distinctive 
aroma characteristics. In the process stage, various data are accommodated in the database. The data is cleaned 
and divided into a set of user transactions representing the activity of each different user during a visit to the 
site. A supervised learning algorithm starts with processing data sets grouping them into training data and 
testing data from coffee production in coffee-producing Indonesian provinces. 


2.1. Research frame 

In this present case, the proposed technology model is used to establish a monitoring system in the 
form of periodic evaluation of coffee production by the farmers, guarantee the quality, and price, and expand 
the interest of coffee shops based on data grouping, in Figure 1. 
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Figure 1. Research frame 


2.1.1. Data collection 

Data processing starts with collecting the original raw data, which significantly impacts the output 
results. Actual data in elemental form is assembled from definite and accurate sources so that the results are 
valid and can be used. In the preparation stage, processing the original raw data is cleaning the data by 
separating and filtering the raw data to eliminate unnecessary and invalid data. The actual data in the basic 
form will be seen whether there is incorrect data, similar, miscalculated, or data that does not exist and 
information is left behind. Data that is not needed is immediately discarded. The result of cleaning the data is 
converted into a suitable form so that it can be analyzed through further processing to ensure that the best 
quality data is entered into the processing unit after cleaning. Processing stage data produces output data ready 
to be processed as input for machine learning and big data analysis modeling. 


2.1.2. Product market controlling 

There are four products among analysis, information, economics, control, efficiency, and service. Of 
the many management variables, the best combination can be adapted to the product environment, price, 
promotion, and distribution. 


2.1.3. Data partition 

To analyze raw data using machine learning techniques, the data is then tidied up, looking for features 
that can represent the data. After the processing stage, it uses existing techniques to find the data. The 
processing stage is the most time-consuming step in machine learning experiments, while training a model may 
take much less time. 


2.1.4. Model 

Develop a machine learning model that requires space to learn. Selecting certain values as hyper 
parameters (different parameters in training) makes the model retested with feedback received when using data 
validation, and in essence, it is also a way of learning. Modeling is done through the build, train, and evaluation 
process. Machine learning modeling is done through the build algorithm. The decision tree method represents 
a function with input in the form of attributes that have a certain value and produces a single result in the form 
of a class. K-nearest neighbor is an algorithm with the principle that each dataset generally has the closest 
distance to other data sets with the same properties. 


2.1.5. Manage data 

An essential process in data analysis because by doing exploration, users will be able to save more 
time in the data analysis process and can find out errors in the data such as missing values, outliers, duplication, 
encoding, data noise, and incomplete data. In big data management, the main focus is on storing and processing 
data efficiently and securely. An analysis is intended to find out new insights or insights. This process uses 
analysis, machine learning, and artificial intelligence visualization to build a model. 


2.1.6. Building architecture big data 


Big data is capable of batch processing in real-time from the support of large amounts of data. In the 
process of storing large amounts of data from the data source, then the management layer receives the data and 


Bulletin of Electr Eng & Inf, Vol. 11, No. 5, October 2022: 2764-2772 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 O 2767 
converts the best quality data into a format that can be understood as a data source for big data analysis and 


processing by data analytic tools and stored according to the data format. 


2.1.7. Analysis 

The problem in this study is first to predict coffee production in Indonesia and then classify the 
interests of coffee consumers based on regional distinctive aroma characteristics. The relationship between 
coffee production and coffee marketing is analyzed through machine learning modeling supported by BDA 
aimed at informing coffee marketing strategies according to the interests of coffee consumers. 


2.2. Machine learning algorithm 

Technology can work alone to analyze data without being reprogrammed based on learning from a 
collection of statistics, mathematics, and data mining. The supervised algorithm model depends on training and 
testing on the task of forming the algorithm. The classification applies regression logistics, decision trees, and 
others. algorithms for numerical prediction/regression use linear regression, decision trees, and others. The 
data sets were derived from Web APIs and web services collected in data spreadsheets. The data collected is 
partitioned into three parts; training data, validation of the learning outcomes model, and test data, namely the 
data used for prediction [24]. Training data produces features used as indicators and selected according to 
modeling purposes. Validation data represents the model's performance on data that has not been studied 
previously. Accordingly, the results can be used for the tuning process. To improve the model's performance, 
tuning parameters are used [25]. 


2.3. Big data analysis 

Building big data for analysis starts with preparing data sources, aggregators, stores, and apps. The 
data aggregator functioned to collect and distribute data [26]. The data store stores processing results on 
document-based NoSQL data [27]. Furthermore, the data is visualized on apps directly related to the user. 
Extensive data analysis can understand data by uncovering trends and patterns [28]. Machine learning 
accelerates complex and diverse data processing processes by applying decision-making algorithms. Machine 
learning can classify, categorize or categorize incoming data, identify patterns that occur, and turn data into 
information that contains insights according to business interests. 

Machine learning forecasts incoming data and acts like an intelligent system that processes past data 
through past experiences to predict future business. Research is currently being developed to classify the 
relationship between coffee production and marketing so that business people have a picture of the performance 
created. This discussion is supported by previous research, such as predicting the future price of a coffee with 
comparative analysis based on its performance using linear regression, XGB, and LSTM techniques [29]. 
Related literature comparisons are in Table 1. 


Table 1. Related literature comparison 


Problem 


Solution 


Method 


Diagnosis and treatment of type 
diabetes 

Investors and other business people 
have a desire to know the future price 
of commodities to make informed 
business decisions 

Agricultural product smart marketing 
problem and innovation in marketing 


Predict diabetes and its complications 
with a literature review 

Predict the future price of a coffee with 
comparative analysis based on their 
performance 


Build an application for the public that 
will inform the marketing of agricultural 


The review used 18 algorithm types from 
machine learning and deep learning. 
Using linear regression, XGB, and LSTM 
technique to build an android application 
by applying a prediction algorithm. 


Technological system innovation in 
agricultural product marketing. 


smart agricultural products products with a smart system 


3. RESULTS AND DISCUSSION 

In this present work, the data is obtained from the web structures in the form of logs, clickstreams, 
cookies, and queries. The problem from the coffee market is taken from a collection of types of coffee that are 
most in-demand by coffee lovers, which can be seen from the variables of coffee interest grouping. The initial 
is stage processing data sets from input data, then responding to produce output data, and making models that 
show work with logical calculations for the response to the new data. Guided learning is used with Regression 
classification techniques aimed at forming predictive modeling of coffee production with coffee marketing. 
Each domain defines each problem differently. Descriptive and plot statistical methods aim to analyze 
exploratory data, distribute probabilities according to the data, generate random numbers and then simulate 
them, and test hypotheses. 
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The decision tree model produces predictor variables and target variables. A decision tree, or grouping 
of the tree and a regression tree, estimates the description of responses to several data to be classified. To group 
the number of responses received, it is necessary to consider the results of the decisions in the tree starting 
from the beginning of the root node to the leaf node. The node of each leaf represents each response it received. 
Each decision tree obtained shows two nodes which consist of indicating a decision node and a leaf node. The 
resulting decision node can make any decision with many branches, while the leaf node shows a decision to be 
taken and contains no additional components. 

This regression and classification tree are included in the supervised machine learning family, namely 
modeling, whose data is referred to as input variables, and there are variables called outputs. This regression 
tree and classification tree are included in the supervised machine learning family, namely modeling whose 
data is referred to as input variables and there are variables called outputs. The regression tree processes the 
output variable (target) with numeric type while the classification tree processes the output variable (target) 
with the categorical type (2 categories or more). 

Classification learner was used, explaining the training data with the description of predictor 
X=IdProvince predictor Y=Average. Decision trees are interpreted, can be used quickly for calculations and 
predictions, and using less storage media results in poor prediction results. SVM class es coffee producer data 
as well as coffee marketing variables. To obtain the best line boundary called a hyperplane, it aims to group 
the predicted data results in the form of dots from one class from another. the best line boundary for SVM 
means the hyperplane with the highest margin between the two classes, in Figure 2. 


Predictions: model 1.1 Predictions: model 2.1 


average 


15 20 15 20 
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Trees Model 


x10 


IDprovince 
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15 2 
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Figure 2. Model prediction 


After classifying using the method is to calculate the performance, as for the performance equation 
using the confusion matrix. True positive indicate that explain predicts positive, and it is actual true negative 
suggests that explain expects negative. It is proper false-positive show kind 1 is an error that explains prediction 
of positive and false, false-negative show kind 2 is an error that explains predicted negative and it was wrong. 
From the results of the coffee production train data set for the last five years, the decision trees modeling is the 
most accurate at 85.7% compared to other algorithms, the and performance measurement results for 
classification problems, in Figure 3. 
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Figure 3. Positive negative rates decision trees 


Represents the maximum limit of the plate parallel to the dividing line that does not indicate the 
interior data point. Classification between neighbors with the closest distance usually produces a level of 
prediction accuracy that shows the best results within the low-dimensional limit but describes the high- 
dimensional state: coffee production modeling, Algorithm performance measurement is carried out to 
determine which algorithm is the best optimal. 

Fl-score will be used to measure the performance of each algorithm. After knowing the accuracy of 
each algorithm, then the performance test is carried out. The model on the data set uses K-fold Cross- 
Validation, which is in charge of testing system classification performance with the number of folds=5. In 
testing using K-fold cross-validation using training data, the data used for model validation uses test data. The 
data set is divided into several K-partitions randomly. To further ensure the accuracy of the data set then tested 
using the AUC on the ROC, in Figure 4. 

AUC is a square-shaped area whose value is always between 0 and 1. AUV value for positive high- 
class production=0.97, AUV value for positive medium class production=0.92, AUV value for positive low- 
class production=0.91. Random performance produces an AUC value of 0.5 because it shows a diagonal line. 
The resulting points are shown on the curve at the area (0,0) and the area (1,1). If the resulting AUC<0.5, then 
the statistical model being evaluated has a meager success of the accuracy value shown in the resulting 
prediction model is inferior if used. AUC value for positive class Robusta=0.94 while AUC value for positive 
class Arabica=0.96. Validation of classi cation modeling in Table 2. 

Accuracy measurements were carried out to review the level of accuracy of the grouping results based 
on existing data and classification purposes, the relationship between production and marketing for Arabica 
Coffee and Robusta Coffee showed that each class had high accuracy. Big data analysis using machine learning 
algorithm modeling will ensure fast decision making. Big data descriptive analysis can classify coffee 
production by province as well as classify coffee specialties from regional coffee types. Big data prescriptive 
analysis helps make better decisions and determines what will happen using optimization and mathematical 
simulations. The application of machine learning, as well as analysis of big data in the coffee business, has 
recently continued to develop, especially regarding the classification of coffee beans aimed at predicting the 
quality of coffee beans, the roasting process, and the quality of coffee produced by coffee production [30]-[33]. 

All of those results are not only important in the research on coffee business research but also in other 
scopes of research such as agriculture and food. The present result also could support the previous results and 
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provide more information on the relationship between coffee production and coffee marketing which is 
important in coffee marketing strategies and mapping of coffee bean variety which is important in the 


agriculture research field. 


e 
& 


AUC = 0.97 


o 
a 


0.00,0.50 


True Positive Rate 
> 
re 


o 
N 


ROC Curve 
Area Under Curve (AUC) 
es Current Classifier 


o 


a] 0.2 oa o6 0.8 1 
False Positive Rate 


High Class Production 


r _ 
it Le > _—_ 
= (0.12,0,89) m Po 
a | “ (0.17,0.91) 
J 
| | SI 
td 5 5 
© | = 
g = 
= f AUC =0.91 Si AUC = 0.92 
= os = 
= j 3 
2 j -= 
5 f Ze 
(=) li = 
en ROC curve 
| name nean I Area under curve (AUQ) 
ap mee A : @ Current classifier 
| © Current classifier 


Q oz o4 os o.s 
a2 as os Oe 1 C 
False positive rate 


False positive rate 
Low’ Class Production Medium Class Production 


Figure 4. Class productions curve 


Table 2. Validation type model 


Type model Class AUC ROC Level of accuracy 
Decision trees High product 0.97 0.00-0.97 High 
Medium product 0.92 0.12-0.89 High 
Low product 0.91 0.17-0.91 High 
Logistic regression Arabica coffee 0.94 0.00-0.86 High 
Robusta coffee 0.92 0.14-1.00 High 


4. CONCLUSION 
The present research shows that the local coffee production by province can be predicted to be high, 


medium, or low, obtained from a big data analysis of coffee production over the last 5 years. The ROC curve 
for the classification of coffee production and local coffee marketing shows that there are three categories of 
coffee production and two types of coffee specialization which are stated to be true. Development of future 
research to be able to collect more data sets of types of coffee throughout Indonesia is essential, therefore, the 
relationship between coffee production and coffee marketing can be found precisely. 
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